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Abstract. Shrinkage estimation has become a basic tool in the analysis 
of high- dimensional data. Historically and conceptually a key develop- 
ment toward this was the discovery of the inadmissibility of the usual 
estimator of a multivariate normal mean. 

This article develops a geometrical explanation for this inadmissibil- 
ity. By exploiting the spherical symmetry of the problem it is possi- 
ble to effectively conceptualize the multidimensional setting in a two- 
dimensional framework that can be easily plotted and geometrically an- 
alyzed. We begin with the heuristic explanation for inadmissibility that 
was given by Stein [In Proceedings of the Third Berkeley Symposium 
on Mathematical Statistics and Probability, 1954-1955, Vol. I (1956) 
197-206, Univ. California Press]. Some geometric figures are included 
to make this reasoning more tangible. It is also explained why Stein's 
argument falls short of yielding a proof of inadmissibility, even when 
the dimension, p, is much larger than p=3. 

We then extend the geometric idea to yield increasingly persuasive 
arguments for inadmissibility when p>3, albeit at the cost of increased 
geometric and computational detail. 

Key words and phrases: Stein estimation, shrinkage, minimax, empir- 
ical Bayes, high-dimensional geometry. 



1. INTRODUCTION 

More than 50 years ago Stein (1956) published his 
classic paper, "Inadmissibility of the usual estimator 
for the mean of a multivariate normal distribution." 
The title result is probably the most startling statis- 
tical discovery of the past century. Erich Lehmann, 
who also worked on the admissibility question, more 
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recently described how he was "stunned with dis- 
belief" when Charles first told him of this result 
(personal communication) . Following the initial dis- 
covery James and Stein (1961) presented their well- 
known shrinkage estimator that provides numeri- 
cally significant improvement of risk relative to that 
of the usual estimator. 

[Hodges and Lehmann (1951) and Girshick and 
Savage (1951) had earlier provided proofs of admis- 
sibility in the unidimensional problem; Lehmann's 
student Blyth (1951) had published another, more 
general, argument for this same fact; and Lehmann 
and Stein (1953) had produced a proof of admissibil- 
ity in a related one-dimensional hypothesis testing 
setting.] 

Stein (1956) begins by describing the multivari- 
ate problem and then gives a heuristic, geometric 
argument intended to convince that the usual es- 
timator should be inadmissible if the dimension is 
sufficiently large. The core of this argument will be 
repeated below, with some additional illustrations 
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that hopefully help to clarify the situation. The ar- 
gument given by Stein provides insight into why in- 
admissibility occurs in very high-dimensional prob- 
lems. But it does not provide a rationale for the fact 
that 3 is the critical dimension — admissibility holds 
in dimension 1 and 2 but not in three or more di- 
mensions. [Section 4 of Stein (1956) contains an ad- 
missibility proof for two dimensions. See also Brown 
(1971) and Brown and Fox (1974).] 

The argument in the following note expands Stein's 
original heuristic idea, clarifies the geometry, and 
provides justification for the fact that 3 is the critical 
dimension. The argument is based on plane geome- 
try and some simple "back-of-the-envelope" Taylor 
series expansions. As with Stein's argument, what 
is given here is not a proof. It could undoubtedly be 
expanded into a proof, but without further insight 
that proof would likely be similar to — and perhaps 
harder than — the existing inadmissibility proofs in 
Stein (1956) and Brown (1966). A slightly different 
geometrically based argument is suggested in Stein 
(1962) and is additionally expanded in Brandwein 
and Strawderman (1990). This argument is men- 
tioned in Section 3. 

Versions of this argument were presented in the 
1960s in oral form independently by L. Brown, by B. 
Efron, and perhaps by others. But so far as we know 
the argument here does not appear in print. In addi- 
tion, we feel it is worthwhile to remind readers of the 
geometric rationale underpinning Stein shrinkage in 
a form that displays that 3 is the critical dimension. 

2. THE ADMISSIBILITY PROBLEM 

Let X = {Xi, . . . jXpY where Xi, i = 1, . . . ,p, are 
independent normal variables with unknown means 
6i,. . . ,9p and all with the same known variance, o"^. 
Without loss of generality, assume a^ = 1. It is de- 
sired to estimate 9 = {6i, . . . , Op)' with the quality of 
an estimate being measured through squared error 
loss, L{d,e) = \\d-e\\^ = ^{di-eif. Let d = 5{X) 
denote an estimator. The risk function of 5 is de- 
noted by R{9;6) = EeiL{5(X))). 

The "usual" estimator of is X itself, that is, 
5o(X) = X. This estimator is intuitive and has sev- 
eral appealing formal properties such as minimaxity, 
best-invariance, maximum likelihood, etc. [See stan- 
dard textbooks such as Lehmann and Casella (1998) 
for discussion of these properties.] 

Prior to Stein (1956) it had been firmly conjec- 
tured that 5() is admissible for any value of p. Ad- 
missibility means that there is no other estimator 



that is better in the sense of risk — formally, that 
there is no estimator 6' such that R{0; 6') < R{0; Sq) 
with strict inequality at some value of 6. [Actually, 
though it is not important in the sequel, we note 
that a well-known supplementary argument shows 
that (5o is inadmissible if and only if there is another 
estimator that is always strictly better in the sense 
that R{e; 6') < R{e- 5q) for ah 0] 

What Stein proved in Sections 2-4 of Stein (1956) 
is: 

Theorem (Stein). 5q is admissible if and only 
ifp<2. 

Our goal is to explain why 6q is inadmissible when 
P>3. 

3. SPHERICAL SYMMETRY 

A spherically symmetric estimator is one that sat- 
isfies 



(1) 



<5(X) 



ixinx 



for some scalar function, r. Of course, 5q is spheri- 
cally symmetric. We confine the search for alterna- 
tives to 5q to the collection of spherically symmetric 
estimators. Geometrically, these are estimators that 
lie on the line through X, and whose distance from 
the origin depends on ||X||. Such an estimator is 
given as in (l)-(3). 

The restriction to spherically symmetric alterna- 
tives is intuitively plausible. To support this intu- 
ition, Stein (1956), Section 3, contains a formal proof 
that 5q is inadmissible if and only if there is a spher- 
ically symmetric estimator which is better. 

Once one has decided to restrict consideration only 
to spherically symmetric estimators it is possible to 
correctly plot and study the multivariate problem 
in a two- dimensional coordinate framework for the 
sample space. One coordinate measures the sample 
in the direction of the true parameter, 9] the other 
coordinate is the length of the orthogonal residual 
from this direction. This leads to the geometric pic- 
ture developed in the following section. 

4. GEOMETRY FOR SPHERICALLY 
SYMMETRIC ESTIMATORS 

Only spherically symmetric estimators need to be 
considered. For such estimators relevant distribu- 
tions depend only on the magnitude of 6; the di- 
rection of the vector 9 does not matter. Formally, 
this means that after the constraint to spherically 
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Fig. 1. A typical observation in the Z = {X\,R) coordinate 
system. 

symmetric estimators it suffices to consider the sit- 
uation when 6 hes on the ^i-axis. So, assume = 
(i9,0, ... ,0)'. Let X = (Xi,X|2))' where X(2) G W'^ . 
Geometrically, X(2) is the residual of X after pro- 
jection on the direction determined by 6. Again, 
only the length of X(n\ matters, not its direction 
in the hyperplane perpendicular to 6. Hence, let 



R 



\Xi2\\\. The relevant statistics for the observed 



sample can thus be rewritten as 

Z = {Xi,R) with Xi ~ iV(i9, 1) 

(2) 

and Xi,R are independent. 



,R^^ 



Xp- 



Spherically symmetric estimators as in (1) are ex- 
pressed similarly in the Z coordinate system as 



(3) 



5(Z)=r(||Z||)Z. 



The Z coordinate system is two-dimensional. Hen- 
ce it can be conveniently visualized geometrically. 
A key feature of the transformation leading from 
the original, X, system to the Z system is that dis- 
tances are preserved. In particular, for spherically 
symmetric estimators 



fX)-0| 



||5(Z)-(^,0)||. 



Thus the squared error risks are the same in the 
two problems. 

Pictorially this can be plotted in standard pla- 
nar coordinates, as pictured in Figure 1. Figure 1 
shows a typical observation of Z in the {Xi , R) coor- 
dinate system. It also represents a spherically sym- 
metric estimate corresponding to Z, as given by for- 
mulas (l)-(3). Pay special attention to the fact that 
this estimator is on the line through Z. Figure 1 also 
shows an additional point ^ = (^i, ^2) = {"&, Vp ~ !)• 




Fig. 2. 2000 observations of Z in the case p 
I? = 25. 



20 and 



This represents the intuitive "center" of the distri- 
bution of Z. 

In terms of Figure 1 the statistical situation can 
be summarized as follows: You observe Z with dis- 
tribution as specified above. You are constrained to 
use only spherically symmetric estimators that lie 
on the line from the origin through Z, as shown 
in the plot. You want to find an estimator that is 
close to Q in terms of squared distance. For the 
point shown on the plot it is fairly clear that there 
are spherically symmetric estimates that are better 
than just Z alone. The point 6 shown on the plot is 
one such better estimate. The goal of the remainder 
of the paper is to substantiate that situations like 
that in the figure are on average sufficiently typical 
(at least when p >3), and hence that appropriate 
shrinkage estimators are better than Z itself. 

[Note that p — 1 = E{R?). Hence it makes sense 
to think of ^/p^^ = ^2 as the center of the distri- 
bution of R. This is not exactly either the mean 
or median of R, but it is sufficiently close and is 
convenient for the following discussion. The exact 
mean of R is E{R) = V2T{p/2)/T{{p - l)/2). For 
p = 5, 10, 17, 26, respectively, this takes the values 
E{R) = 1.850,2.918,3.938,4.950 as compared to the 
values ^2 = Vp ~ 1 = 2, 3, 4, 5. Asymptotically, E{R) - 
Vp^ - l/(4VP^) + 0{{p - l)-3/2).] 

Figure 2 shows a typical sample of 2000 obser- 
vations of Z in the case p = 20 and t9 = 25. The 
dominant feature is that the sample points are mod- 
erately tightly clustered about ^ = (25, \/l9) and 
hence are much closer to ^ than they are to the 
parameter point 9 = (??,0). 

5. STEIN'S HEURISTIC ARGUMENT 

It is fairly clear from pictures like Figure 2 that 
shrinking the observations somewhat toward the ori- 
gin will often bring the estimator closer to the true 
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Fig. 3. Geometry of the naive optimal estimator: it shows 
the origin, O, the points A = (i?,0), B = {■&, ^/p — 1) and C, 
the projection of A on the line OB. 

mean = (i9, 0). Even more striking — consider what 
happens in a plot hke Figure 2 as p — )■ oo for fixed 
6 = (•&, 0). Then the cloud of points moves vertically 
upward. Eventually, virtually the entire cloud lies 
outside the circle of radius \\9\\. To be more precise 



(4) 



\xf = \\ef + p + Op{^) 



as p— 7- oo for any fixed 6. This asymptotic fact can 
be derived from the non-central chi-squared distri- 
bution of ||X|p or from a simple Taylor approxima- 
tion as is done in Stein's heuristic argument. Viewed 
another way, (4) says that 



1^1 



(5) 



\X 



Xf-p-Opi^) 
II P + Opi^) 
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Any observation that lies outside of the sphere of 
radius \\9\\ can be brought closer to 6 by shrinking 
it toward the origin so as to lie on the sphere. (Actu- 
ally, somewhat more shrinkage is desirable as will be 
clear from the discussion of Figure 3, below.) This 

suggests that shrinkage by a factor - — 1|^|/^ should 
be desirable as p — )• oo. The argument in Stein (1956) 
elaborates a little further and shows with a Taylor 
expansion that shrinkage by a factor ^ij^n ' is still 
advantageous as p — t- oo for any fixed 6. This moti- 
vates the use of the estimator Sp{X) = (1 — m j'lp )^- 
This is related to what is used in Stein (1956) to 
prove inadmissibility of the usual estimator. The 
James-Stein (1961) estimator 5p^2 is better than 
the usual one when p>3, as proved in that paper 
and later in a more efficient manner through Stein's 
unbiased estimate of risk in Stein (1973, 1981). 



[Since p — )■ oo the difference between the factor 
p/||X|p in this argument and the factor (p — 2)/||X|p 
in James and Stein (1961) is irrelevant. For fixed p 
it can be shown by the arguments mentioned above 
that 5p dominates 6o whenever p > 4.] 

Stein (1956) writes that "With some additional 
precision this [heuristic argument] could be made 
. . . [in] to. . . a proof that for sufficiently large [p] the 
usual estimator is inadmissible." This is the type of 
exaggeration that may be excused by the above be- 
ing only meant as a heuristic argument. In fact much 
more than "some additional precision" is needed to 
prove the usual estimator is inadmissible for suffi- 
ciently large p. The reason that the above does not 
easily yield a proof of inadmissibility is that it only 
holds for any fixed 6 as p — )• oo. It does not hold 
uniformly in 6, but a uniform argument is needed 
in order to prove inadmissibility. 

To be more precise, for any p no matter how large, 
infejPedlXll > ll^ll)} = 1/2, rather than approach- 
ing 1 as is implicitly suggested within the heuristic 
argument, and as would be needed to easily convert 
the heuristic argument into a proof. 

Hence a more elaborate argument is needed to 
prove that the usual estimator is inadmissible. The 
following discussion presents a heuristic argument 
for inadmissibility that is consistent with the geo- 
metric insight in Stein's motivation. 

6. DESIRED AMOUNT OF SHRINKAGE; 
TYPICAL OBSERVATION 

Figures 1 and 2 show that the observations are 
close to ^ = (f?, -y/p^n^), whereas the estimate should 
be as close as possible to = (t?,0). Figure 3 il- 
lustrates the geometry of this situation when Z = 
(t?, y/p — 1). It shows the origin (O), the point A = 
6 = (i?, 0) which is the desired target of the estimate, 
and the point B = ^ = ("i?, ^/p — 1) which is a typical 
observation. For such an observation any spherically 
symmetric estimator must be on the line OB. The 
point C in Figure 3 is the point on that line which is 
closest to the desired target, A. A similar triangles 
yield that 



|AB| 



|BC| 



where |AB| denotes the length of the segment AB, 
etc. Simplifying yields 



(6) 



IAB|2 
|BC| = ^^ 
' ' lOBl 



p-1 

■ m 
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The point C is the best estimate based on an ob- 
servation at B = ^. By (6) it can be written as 

1^ 



B 



1 



p- 



m\' 



e. 



By comparison with (1), this suggests that the 
optimal spherically symmetric estimator will be the 
Naive Geometrically Optimal estimator 



(7) 



(^NGoCZ) 



1 



P 



1 



z. 



The discussion leading to (7) suggests that 



1-^U 



\x\ 

should dominate Jq- The above motivation and con- 
struction of (5ngo does not suffer from the defect 
noted above in Stein's original heuristics — it does 
not require p — )• oo for each fixed i?. However, it sug- 
gests that 6o is inadmissible even for p = 2. This sug- 
gestion is not correct; and so a more careful heuristic 
argument is needed to get a better description of the 
relevant geometry. 

7. STOCHASTIC VARIATION 

The estimator (5ngo in (7) is only optimal at ^ = 
(■!?, ^yp — 1), the central point of the distribution of Z. 
Of course, Z is not identically ^, but is only stochas- 
tically close to ^. The calculation leading to (7) is 
only approximate, not exact. There is a small price 
in accuracy to be paid in order to accommodate the 
stochastic variation of Z. In order to better under- 
stand the composition of this price consider a par- 
ticular pair of equally likely possible points for Z. 
These points are labeled ^+,^- in Figure 4. They 
are defined as 



e± = (^±i,vv^)- 

These points exhibit typical stochastic variation in 
the direction of = ("i?, 0) since their mean and mean 
squared distance in that direction match those of 
the full distribution. While they do not accurately 
model the stochastic variation in the direction or- 
thogonal to 6 = ('i?,0), it turns out that this addi- 
tional variability is only of secondary importance. 
Thus, we will ignore the effect of this orthogonal 
variation for now. It becomes clear from the ex- 
act expression discussed later at (14)~(15) that the 
orthogonal variation is indeed of secondary impor- 
tance in calculation of the difference in risks. 

Note that L+ < ||e+ -^f = U--^f but L_ can 
be > ||^_ — iJlp. Calculations in the test show that 
i(L+ + L_) < ||^± - ^fwhen 0<C<2{p- 2). 



R* 



^O , 




Fig. 4. The values of ^± and their respective estimates. 

In order to allow for additional discussion consider 
the general form 



(8) 



6c (Z) 



1 



C 



IZIP 



The case C = p — 1 is motivated by the preced- 
ing geometric argument. But the following calcula- 
tions suggest that because of the stochastic varia- 
tion modeled through ^_|_, ^_ a preferable choice is 
C = p — 2, as in the ordinary James-Stein estimator. 

Break down the risk into two components corre- 
sponding to the directions determined by the coordi- 
nates Z = {Xi , R) . This is similar to the suggestion 
in Stein (1956), remark (vii). Related calculations 
are described in Efron and Morris (1971). Let L± 
denote the squared error from an observation at one 
of the two equally likely points (,+ , C~ , respectively, 

-\2 



L± 



1 



C 



m 



(t?±1)-i9 



+ 



C 



u±r 



n2 



vV^-0 



±1 



+ 1 



c 



iie±p 



(^±1) 



c 



lie- "2 



,±i 



(p-i) 



L^^+L 



(2) 



say. 



Let R\^j_{d,Sc) denote the conditional risk given 
that Z = ^^ or ^„ . Then 



Ri 



l«± 



(1)^ 



iiL^^>+L^y) + 1(1^:^+1^^] 



A p(l) , p(2) 



say. 
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Is 5c better than 5q for this conditional problem? 
To examine this we look at the coordinate-wise dif- 
ference in conditional risks. For 5q the coordinate- 
wise risks are 1 and p — 1, respectively. Hence the 
coordinate-wise differences are 

(1) ^ 1 / 2C(^ + 1) _ C^(^ + l)^ 



1-i? 



and 



(p-1) 
1 



R 



(2) 



(p-1) 



U- 



2C 



U- 



m 



+ 



2C 



U- 



c^ 



+ 



C 



M+r lie- 

In order to better interpret this expression rear- 
range terms so as to write the improvement of 6c 
over (5o in this conditional problem as 

A 



A 



\i±=P- 



R 



C^ 



1 



1 



11^4 



(9) 



+ Cp 



¥' 



1 



lie- 



+ 



1 



iie+p ' iie-i 
{'d+if+p-i 
iie+r 
{^-if+p- 



+ 



Cd 



+ 



1 



11^ 

1 



iie^ 

Cp- 



U^ 



-C 



Ua 



+ 



lie- 



since ^± = {•& ±1)'^ + p — 1. 

If it were so that ||^+|p = ||^_|p then the first ma- 
jor term on the right of (9) would be =0, and the dif- 
ference in (9) would be positive for any < C < 2p. 
In particular, for any p > 2 it would he positive for 
C = p—\. (It could even be positive for p = l!) This 
of course makes no sense as a statistical solution and 
only confirms that it provides an incorrect insight to 
ignore that ||^+|p > llC-lP- 

Now, look at (9), and take into account that 
lie+f > lie- f- Then, ^-^ < 0, and the first 

term on the right of (3.9) is negative and partially 



compensates for the remaining term which is posi- 
tive when C = p — 1. In more detail, 

_^ 1 . iie-f-iie+f 

iie+p iie-P iie+Piie-P 

= -4 t 

iie+Piie-P' 
1 , 1 ^ _ iie-P + iie+f _o ^'+P 



iie+p iie-P iie+Piie-P iie+Piie-P' 

Hence the difference in conditional risks for p > 2 
2 



IS 



(10) 



iie+Piie-P 



c^ 



C(p-2)-— h9^+ Cp-— p 



C72 



> 



2{d^+p) 



C{p-2) 



2 



iie+piie- 

It follows that the difference in conditional risks 
is positive so long as < C < 2(p — 2). In particular 
the difference is positive for p > 3 and C = p — 1, the 
value motivated by the geometric argument centered 
on Figure 3. On the other hand, the best choice of 
constant in (10) is the slightly smaller value C = 
p — 2. The improvement in risks is not as great as 
that suggested in the argument around Figure 3, and 
this can be considered as a necessary penalty due to 
the randomness in X. In summary, the result in (10) 
provides a heuristic motivation for inadmissibility to 
hold wheneverp > 3. 

8. WHAT CAN BE PROVED 

Note in (10) that the three terms in the leading 
fraction are all approximately equal; that is, i?^ +p ~ 
||i^4.p ~ lie+P- Hence the argument leading to (10) 
suggests that the unconditional difference in risks, 
A = R{6,6q) — R{6,6c), will be well approximated 
as 



:ii) 



A = R{e,6o 
2 



R{0,dc) 



.p+pv^(^-')-? 



The quality of this approximation improves as \\6\ 
oo in the sense that 
2 



(12) 



|c/|p-|-p\ 2 



as \\6\ 
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The preceding arguments can be refined to prove 
the assertion in (12). This is essentially the path 
followed by Stein in his original argument in Stein 
(1956). In order to allow calculations accurate only 
for large \\9\\, Stein replaced 5c with the estimator 



Sc;a 



1 



C 



a+||X|| 



X. 



Then an exact Taylor expansion that can be conside- 
red as an elaboration of the above calculations yields 



(13) 



R{e,6o)-R{e,Sc; 

^ 'C{p- 



a+||6>||2 



r2 

2)- — 
^ 2 



+ o 



1 



a+||6>||2 



uniformly in \\0\\. It follows that 6q is inadmissible. 

The argument in Stein (1956) for (13) involves 
only low-order moments of X — 0. Hence it can be 
generalized from the normal distribution setting to 
apply to more general location parameter problems. 
It can also be adapted to apply (with modifications) 
to problems in which the loss function is not squared 
error. Such generalizations appear in Brown (1966). 

When one considers only the normal distribution 
setting, then 



(14) 



A = Rie,6o)-R{e,6c) 



Eg 



IXIP 



ry2 

C{p-2)- — 



This result is proved but not explicitly stated in 
James and Stein (1961). It is explicitly stated and 
proved using the unbiased estimate of the risk in 
Stein (1973, 1981). 

Note that 



(15) Ee 



1 



IXII 



1 



Eemr) \\o\?+p' 



with the approximation being quite close except 
when ||0|| is small. Hence the heuristic approxima- 
tion in (10) and (11) is quite close to the truth. 
This validates the heuristic idea to approximate the 
unconditional difference in risks by the conditional 
difference given = ,^+ , ^_ . 
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